MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes
نویسندگان
چکیده
Fault-tolerance is an essential element to the distributed system which requires the reliable computation environment. In spite of extensive researches over two decades, practical fault-tolerance systems have not been provided. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and practice of fault-tolerance systems and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management for MPI and message queue management. MPICH-GF requires no modification of application source codes and affects the MPICH communication as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the virtual device level. We have evaluated MPICH-GF with NPB applications on Globus middleware.
منابع مشابه
Checkpointing and Migration of parallel processes based on Message Passing Interface
This paper presents a Checkpoint-based Rollback Recovery and Migration System for Message Passing Interface, ChaRM4MPI, for Linux Clusters. Some important fault tolerant mechanisms are designed and implemented in this system, which include coordinated checkpointing protocol, synchronized rollback recovery, process migration, and so on. Owing to ChaRM4MPI, the node transient faults can be recove...
متن کاملEfficient Transparent Optimistic Rollback Recovery for Distributed Application Programs
Existing rollback-recovery methods using consistent checkpointing may cause high overhead for applications that frequently send output to the “outside world,” since a new consistent checkpoint must be written before the output can be committed, whereas existing methods using optimistic message logging may cause large delays in committing output, since processes may buffer received messages arbi...
متن کاملAn Application-Transparent, Platform-Independent Approach to Rollback-Recovery for Mobile Agent Systems
This paper proposes a new approach to rollback-recovery for mobile-agent systems, and describes its implementation in the MESSENGERS mobile agents system. The used checkpointing method allows to implement space and time efficient, user-transparent rollback-recovery in heterogeneous distributed environments. Together with an efficient non-blocking system snapshot algorithm this checkpointing met...
متن کاملManetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
Manetho is a new transparent rollback recovery protocol for long running distributed computations It uses a novel combination of antecedence graph maintenance unco ordinated checkpointing and sender based message logging Manetho simultaneously achieves the advantages of pessimistic message logging namely limited rollback and fast output commit and the advantage of optimistic message logging nam...
متن کاملMPI support on opportunistic grids based on the InteGrade middleware
The Message Passing Interface (MPI) is a popular programming model for parallel applications. Support for MPI in grid middleware is important for the widespread use of grids for parallel programming. This enables existing parallel applications to be executed on large-scale grids, as opposed to being restricted to local clusters. In the specific case of opportunistic grids, the use of idle compu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IEICE Transactions
دوره 87-D شماره
صفحات -
تاریخ انتشار 2004